# Object Detection and Segmentation

Paligemma2 3b Mix 224 Jax
PaliGemma 2 is an upgraded vision-language model based on Gemma 2, supporting multilingual image-text input and text output, specifically designed for vision-language tasks
Text-to-Image
P
google
38
1
Paligemma2 28b Pt 896
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting image and text inputs to generate text outputs.
Image-to-Text Transformers
P
google
116
48
Paligemma2 10b Pt 896
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, integrating Gemma 2 capabilities, supporting image and text input to generate text output
Image-to-Text Transformers
P
google
233
32
Paligemma2 10b Pt 448
PaliGemma 2 is Google's upgraded vision-language model (VLM) that combines Gemma 2 capabilities, supporting image and text input to generate text output.
Image-to-Text Transformers
P
google
282
14
Paligemma2 10b Pt 224
PaliGemma 2 is a vision-language model (VLM) that combines the capabilities of the Gemma 2 model. It can process both image and text inputs simultaneously and generate text outputs, supporting multiple languages. It is suitable for various vision-language tasks such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.
Image-to-Text Transformers
P
google
3,362
8
Paligemma2 3b Pt 896
PaliGemma 2 is a multimodal vision-language model that combines image and text inputs to generate text outputs. It supports multiple languages and is suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
2,536
22
Paligemma2 10b Mix 224
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
701
7
Paligemma2 3b Mix 224
PaliGemma 2 is an upgraded vision-language model developed by Google, combining the capabilities of Gemma 2, supporting image and text inputs to generate text outputs, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
15.23k
28
Florence 2 Large No Flash Attn
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle diverse visual tasks through unified representation, enabling functions like image captioning and object detection.
Text-to-Image PyTorch
F
multimodalart
73.91k
16
Florence 2 Base Ft
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.
Image-to-Text Transformers
F
lodestones
14
0
Paligemma 3b Ft Widgetcap 224
PaliGemma is a multi-functional lightweight vision-language model that combines image and text inputs to generate text outputs. It supports multiple languages and performs excellently in various vision-language tasks.
Image-to-Text Transformers
P
google
135
2
Paligemma 3b Ft Scicap 448
PaliGemma is a multi-functional lightweight vision-language model that combines image and text inputs to generate text outputs and supports multiple languages.
Text-to-Image Transformers
P
google
123
0
Paligemma 3b Ft Cococap 224
PaliGemma is a multi-functional lightweight vision-language model (VLM) that supports multi-language input and output and is suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
209
1
Paligemma 3b Ft Nlvr2 224
PaliGemma is a multi-functional lightweight vision-language model (VLM) that supports multilingual input and output and excels in various vision-language tasks such as image captioning and visual question answering.
Text-to-Image Transformers
P
google
2,056
1
Paligemma 3b Mix 448
PaliGemma is a versatile lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs
Image-to-Text Transformers
P
google
5,488
109
Paligemma 3b Ft Nlvr2 448
PaliGemma is a versatile and lightweight vision-language model (VLM) that supports image and text input and generates text output, suitable for various vision-language tasks.
Text-to-Image Transformers
P
google
2,350
0
Paligemma 3b Ft Vqav2 224
PaliGemma is a multi-functional lightweight vision-language model that combines image and text inputs to generate text outputs and supports multiple languages.
Text-to-Image Transformers
P
google
150
2
Paligemma 3b Ft Docvqa 896
PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.
Image-to-Text Transformers
P
google
519
9
Paligemma 3b Pt 224
PaliGemma is a versatile lightweight vision-language model (VLM) built upon SigLIP vision model and Gemma language model, capable of processing both image and text inputs to generate text outputs.
Image-to-Text Transformers
P
google
38.40k
318
Paligemma 3b Ft Scicap 224
PaliGemma is a lightweight vision-language model that combines image and text inputs to generate text outputs, supporting multilingual and multi-task processing.
Image-to-Text Transformers
P
google
107
0
Paligemma 3b Ft Ocrvqa 896
PaliGemma is a multi-functional lightweight vision-language model that supports image and text input and generates text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
2,056
14
Paligemma 3b Ft Science Qa 224
PaliGemma is a multi-functional lightweight vision-language model (VLM) that supports image and text input and generates text output, suitable for various vision-language tasks.
Text-to-Image Transformers
P
google
113
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase